580 research outputs found

    Automatic document classification of biological literature

    Get PDF
    Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept

    An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

    Get PDF
    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV

    WormBase 2012: more genomes, more data, new website

    Get PDF
    Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community

    Acknowledgements

    Get PDF
    In this book, an international team of fourteen scholars investigates the Chinese reception of Indian Buddhist ideas, especially in the sixth and seventh centuries. Topics include Buddhist logic and epistemology (pramāṇa, yinming); commentaries on Indian Buddhist texts; Chinese readings of systems as diverse as Madhyamaka, Yogācāra and tathāgatagarbha; the working out of Indian concepts and problematics in new Chinese works; and previously under-studied Chinese evidence for developments in India. The authors aim to consider the ways that these Chinese materials might furnish evidence of broader Buddhist trends, thereby problematizing a prevalent notion of “sinification”, which has led scholars to consider such materials predominantly in terms of trends ostensibly distinctive to China. The volume also tries to go beyond seeing sixth- and seventh-century China primarily as the age of the formation and establishment of the Chinese Buddhist “schools”. The authors attempt to view the ideas under study on their own terms, as valid Buddhist ideas engendered in a rich, “liminal” space of interchange between two large traditions.In this book, an international team of fourteen scholars investigates the Chinese reception of Indian Buddhist ideas, especially in the sixth and seventh centuries. Topics include Buddhist logic and epistemology (pramāṇa, yinming); commentaries on Indian Buddhist texts; Chinese readings of systems as diverse as Madhyamaka, Yogācāra and tathāgatagarbha; the working out of Indian concepts and problematics in new Chinese works; and previously under-studied Chinese evidence for developments in India. The authors aim to consider the ways that these Chinese materials might furnish evidence of broader Buddhist trends, thereby problematizing a prevalent notion of “sinification”, which has led scholars to consider such materials predominantly in terms of trends ostensibly distinctive to China. The volume also tries to go beyond seeing sixth- and seventh-century China primarily as the age of the formation and establishment of the Chinese Buddhist “schools”. The authors attempt to view the ideas under study on their own terms, as valid Buddhist ideas engendered in a rich, “liminal” space of interchange between two large traditions

    About the Authors

    Get PDF
    In this book, an international team of fourteen scholars investigates the Chinese reception of Indian Buddhist ideas, especially in the sixth and seventh centuries. Topics include Buddhist logic and epistemology (pramāṇa, yinming); commentaries on Indian Buddhist texts; Chinese readings of systems as diverse as Madhyamaka, Yogācāra and tathāgatagarbha; the working out of Indian concepts and problematics in new Chinese works; and previously under-studied Chinese evidence for developments in India. The authors aim to consider the ways that these Chinese materials might furnish evidence of broader Buddhist trends, thereby problematizing a prevalent notion of “sinification”, which has led scholars to consider such materials predominantly in terms of trends ostensibly distinctive to China. The volume also tries to go beyond seeing sixth- and seventh-century China primarily as the age of the formation and establishment of the Chinese Buddhist “schools”. The authors attempt to view the ideas under study on their own terms, as valid Buddhist ideas engendered in a rich, “liminal” space of interchange between two large traditions.In this book, an international team of fourteen scholars investigates the Chinese reception of Indian Buddhist ideas, especially in the sixth and seventh centuries. Topics include Buddhist logic and epistemology (pramāṇa, yinming); commentaries on Indian Buddhist texts; Chinese readings of systems as diverse as Madhyamaka, Yogācāra and tathāgatagarbha; the working out of Indian concepts and problematics in new Chinese works; and previously under-studied Chinese evidence for developments in India. The authors aim to consider the ways that these Chinese materials might furnish evidence of broader Buddhist trends, thereby problematizing a prevalent notion of “sinification”, which has led scholars to consider such materials predominantly in terms of trends ostensibly distinctive to China. The volume also tries to go beyond seeing sixth- and seventh-century China primarily as the age of the formation and establishment of the Chinese Buddhist “schools”. The authors attempt to view the ideas under study on their own terms, as valid Buddhist ideas engendered in a rich, “liminal” space of interchange between two large traditions

    Index

    Get PDF
    In this book, an international team of fourteen scholars investigates the Chinese reception of Indian Buddhist ideas, especially in the sixth and seventh centuries. Topics include Buddhist logic and epistemology (pramāṇa, yinming); commentaries on Indian Buddhist texts; Chinese readings of systems as diverse as Madhyamaka, Yogācāra and tathāgatagarbha; the working out of Indian concepts and problematics in new Chinese works; and previously under-studied Chinese evidence for developments in India. The authors aim to consider the ways that these Chinese materials might furnish evidence of broader Buddhist trends, thereby problematizing a prevalent notion of “sinification”, which has led scholars to consider such materials predominantly in terms of trends ostensibly distinctive to China. The volume also tries to go beyond seeing sixth- and seventh-century China primarily as the age of the formation and establishment of the Chinese Buddhist “schools”. The authors attempt to view the ideas under study on their own terms, as valid Buddhist ideas engendered in a rich, “liminal” space of interchange between two large traditions.In this book, an international team of fourteen scholars investigates the Chinese reception of Indian Buddhist ideas, especially in the sixth and seventh centuries. Topics include Buddhist logic and epistemology (pramāṇa, yinming); commentaries on Indian Buddhist texts; Chinese readings of systems as diverse as Madhyamaka, Yogācāra and tathāgatagarbha; the working out of Indian concepts and problematics in new Chinese works; and previously under-studied Chinese evidence for developments in India. The authors aim to consider the ways that these Chinese materials might furnish evidence of broader Buddhist trends, thereby problematizing a prevalent notion of “sinification”, which has led scholars to consider such materials predominantly in terms of trends ostensibly distinctive to China. The volume also tries to go beyond seeing sixth- and seventh-century China primarily as the age of the formation and establishment of the Chinese Buddhist “schools”. The authors attempt to view the ideas under study on their own terms, as valid Buddhist ideas engendered in a rich, “liminal” space of interchange between two large traditions

    Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR

    Get PDF
    WormBase, dictyBase and The Arabidopsis Information Resource (TAIR) are model organism databases containing information about Caenorhabditis elegans and other nematodes, the social amoeba Dictyostelium discoideum and related Dictyostelids and the flowering plant Arabidopsis thaliana, respectively. Each database curates multiple data types from the primary research literature. In this article, we describe the curation workflow at WormBase, with particular emphasis on our use of text-mining tools (BioCreative 2012, Workshop Track II). We then describe the application of a specific component of that workflow, Textpresso for Cellular Component Curation (CCC), to Gene Ontology (GO) curation at dictyBase and TAIR (BioCreative 2012, Workshop Track III). We find that, with organism-specific modifications, Textpresso can be used by dictyBase and TAIR to annotate gene productions to GO's Cellular Component (CC) ontology

    Publishing Interactive Articles: Integrating Journals And Biological Databases

    Get PDF
    In collaboration with the journal GENETICS, we've developed and launched a pipeline by which interactive full-text HTML/PDF journal articles are published with named entities linked to corresponding resource pages in "WormBase":http://www.wormbase.org/ (WB). Our interactive articles allow a reader to click on over ten different data type objects (gene, protein, transgene, etc.) and be directed to the relevant webpage. This seamless connection from the article to summaries of data types promotes a deeper level of understanding for the naïve reader, and incisive evaluation for the sophisticated reader. Further, this collaboration allows us to identify and collect information before the publication of the article. The pipeline uses automated recognition scripts to identify entities that already exist in the database and a self-reporting form we created at WB that is sent to the author by GENETICS for submitting entities that do not already exist in our database. We include a manual quality control step to make sure ambiguous links are corrected, and that all new entities have been reported and linked properly. The automated entity recognition scripts allows us to potentially link any object found in a database as well as to expand this pipeline to other databases. We have already adapted this pipeline for linking _Saccharomyces cerevisiae_ GENETICS articles to the "Saccharomyces Genome Database":http://www.yeastgenome.org/ (SGD) and are currently expanding this pipeline for linking genes in _Drosophila_ articles to "FlyBase":http://flybase.org/. By integrating journals and databases, we are integrating the major modes of communication in the biological sciences, which will undoubtedly increase the pace of discovery.
&#xa

    Functional principal component analysis for identifying multivariate patterns and archetypes of growth, and their association with long-term cognitive development

    Get PDF
    For longitudinal studies with multivariate observations, we propose statistical methods to identify clusters of archetypal subjects by using techniques from functional data analysis and to relate longitudinal patterns to outcomes. We demonstrate how this approach can be applied to examine associations between multiple time-varying exposures and subsequent health outcomes, where the former are recorded sparsely and irregularly in time, with emphasis on the utility of multiple longitudinal observations in the framework of dimension reduction techniques. In applications to children's growth data, we investigate archetypes of infant growth patterns and identify subgroups that are related to cognitive development in childhood. Specifically, "Stunting" and "Faltering" time-dynamic patterns of head circumference, body length and weight in the first 12 months are associated with lower levels of long-term cognitive development in comparison to "Generally Large" and "Catch-up" growth. Our findings provide evidence for the statistical association between multivariate growth patterns in infancy and long-term cognitive development
    corecore